Predicting Online News Article Popularity


Introduction

The aim of this projecr is to explore a dataset in depth, apply a business analytics mindset to implement appropriate predictive analytics, and communicate the findings effectively.

The dataset is comprised of some statistical measures on online news articles. Hence, this analysis builds a machine learning system on the dataset to predict the popularity of online news articles. The goal of the analysis is to use this system to configure and present future articles that sell more advertisement.

Dataset

The dataset set comes from UCI Machine Learning Repository: Online News Popularity Data Set. This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity).

Data Description

Attribute Information:

 0. url:                           URL of the article
 1. timedelta:                     Days between the article publication and the dataset acquisition
 2. n_tokens_title:                Number of words in the title
 3. n_tokens_content:              Number of words in the content
 4. n_unique_tokens:               Rate of unique words in the content
 5. n_non_stop_words:              Rate of non-stop words in the content
 6. n_non_stop_unique_tokens:      Rate of unique non-stop words in the content
 7. num_hrefs:                     Number of links
 8. num_self_hrefs:                Number of links to other articles published by Mashable
 9. num_imgs:                      Number of images
10. num_videos:                    Number of videos
11. average_token_length:          Average length of the words in the content
12. num_keywords:                  Number of keywords in the metadata
13. data_channel_is_lifestyle:     Is data channel 'Lifestyle'?
14. data_channel_is_entertainment: Is data channel 'Entertainment'?
15. data_channel_is_bus:           Is data channel 'Business'?
16. data_channel_is_socmed:        Is data channel 'Social Media'?
17. data_channel_is_tech:          Is data channel 'Tech'?
18. data_channel_is_world:         Is data channel 'World'?
19. kw_min_min:                    Worst keyword (min. shares)
20. kw_max_min:                    Worst keyword (max. shares)
21. kw_avg_min:                    Worst keyword (avg. shares)
22. kw_min_max:                    Best keyword (min. shares)
23. kw_max_max:                    Best keyword (max. shares)
24. kw_avg_max:                    Best keyword (avg. shares)
25. kw_min_avg:                    Avg. keyword (min. shares)
26. kw_max_avg:                    Avg. keyword (max. shares)
27. kw_avg_avg:                    Avg. keyword (avg. shares)
28. self_reference_min_shares:     Min. shares of referenced articles in Mashable
29. self_reference_max_shares:     Max. shares of referenced articles in Mashable
30. self_reference_avg_sharess:    Avg. shares of referenced articles in Mashable
31. weekday_is_monday:             Was the article published on a Monday?
32. weekday_is_tuesday:            Was the article published on a Tuesday?
33. weekday_is_wednesday:          Was the article published on a Wednesday?
34. weekday_is_thursday:           Was the article published on a Thursday?
35. weekday_is_friday:             Was the article published on a Friday?
36. weekday_is_saturday:           Was the article published on a Saturday?
37. weekday_is_sunday:             Was the article published on a Sunday?
38. is_weekend:                    Was the article published on the weekend?
39. LDA_00:                        Closeness to LDA topic 0
40. LDA_01:                        Closeness to LDA topic 1
41. LDA_02:                        Closeness to LDA topic 2
42. LDA_03:                        Closeness to LDA topic 3
43. LDA_04:                        Closeness to LDA topic 4
44. global_subjectivity:           Text subjectivity
45. global_sentiment_polarity:     Text sentiment polarity
46. global_rate_positive_words:    Rate of positive words in the content
47. global_rate_negative_words:    Rate of negative words in the content
48. rate_positive_words:           Rate of positive words among non-neutral tokens
49. rate_negative_words:           Rate of negative words among non-neutral tokens
50. avg_positive_polarity:         Avg. polarity of positive words
51. min_positive_polarity:         Min. polarity of positive words
52. max_positive_polarity:         Max. polarity of positive words
53. avg_negative_polarity:         Avg. polarity of negative  words
54. min_negative_polarity:         Min. polarity of negative  words
55. max_negative_polarity:         Max. polarity of negative  words
56. title_subjectivity:            Title subjectivity
57. title_sentiment_polarity:      Title polarity
58. abs_title_subjectivity:        Absolute subjectivity level
59. abs_title_sentiment_polarity:  Absolute polarity level
60. shares:                        Number of shares (target)

Importing the data

The data is imported into R and it comprises of 61 features.The 61st feature i.e. shares is the target variable i.e. an article is eligible for publishing or good enough to sell advertisement if the number of shares are more than 14000.

##  [1] "url"                           "timedelta"                    
##  [3] "n_tokens_title"                "n_tokens_content"             
##  [5] "n_unique_tokens"               "n_non_stop_words"             
##  [7] "n_non_stop_unique_tokens"      "num_hrefs"                    
##  [9] "num_self_hrefs"                "num_imgs"                     
## [11] "num_videos"                    "average_token_length"         
## [13] "num_keywords"                  "data_channel_is_lifestyle"    
## [15] "data_channel_is_entertainment" "data_channel_is_bus"          
## [17] "data_channel_is_socmed"        "data_channel_is_tech"         
## [19] "data_channel_is_world"         "kw_min_min"                   
## [21] "kw_max_min"                    "kw_avg_min"                   
## [23] "kw_min_max"                    "kw_max_max"                   
## [25] "kw_avg_max"                    "kw_min_avg"                   
## [27] "kw_max_avg"                    "kw_avg_avg"                   
## [29] "self_reference_min_shares"     "self_reference_max_shares"    
## [31] "self_reference_avg_sharess"    "weekday_is_monday"            
## [33] "weekday_is_tuesday"            "weekday_is_wednesday"         
## [35] "weekday_is_thursday"           "weekday_is_friday"            
## [37] "weekday_is_saturday"           "weekday_is_sunday"            
## [39] "is_weekend"                    "LDA_00"                       
## [41] "LDA_01"                        "LDA_02"                       
## [43] "LDA_03"                        "LDA_04"                       
## [45] "global_subjectivity"           "global_sentiment_polarity"    
## [47] "global_rate_positive_words"    "global_rate_negative_words"   
## [49] "rate_positive_words"           "rate_negative_words"          
## [51] "avg_positive_polarity"         "min_positive_polarity"        
## [53] "max_positive_polarity"         "avg_negative_polarity"        
## [55] "min_negative_polarity"         "max_negative_polarity"        
## [57] "title_subjectivity"            "title_sentiment_polarity"     
## [59] "abs_title_subjectivity"        "abs_title_sentiment_polarity" 
## [61] "shares"

Explore and Clean the dataset

To find the which of the remaining 60 features are best for predicting the shares we must inspect and look for patterns in the data through summaries and plots.

Exploration: Distributions of features

Lets check the distribution of values that each column takes. To do so we need to filter out the columns that have numeric values only and as seen from the result of str(news) below all the columns except url are numeric.

## 'data.frame':    39644 obs. of  61 variables:
##  $ url                          : chr  "http://mashable.com/2013/01/07/amazon-instant-video-browser/" "http://mashable.com/2013/01/07/ap-samsung-sponsored-tweets/" "http://mashable.com/2013/01/07/apple-40-billion-app-downloads/" "http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/" ...
##  $ timedelta                    : num  731 731 731 731 731 731 731 731 731 731 ...
##  $ n_tokens_title               : num  12 9 9 9 13 10 8 12 11 10 ...
##  $ n_tokens_content             : num  219 255 211 531 1072 ...
##  $ n_unique_tokens              : num  0.664 0.605 0.575 0.504 0.416 ...
##  $ n_non_stop_words             : num  1 1 1 1 1 ...
##  $ n_non_stop_unique_tokens     : num  0.815 0.792 0.664 0.666 0.541 ...
##  $ num_hrefs                    : num  4 3 3 9 19 2 21 20 2 4 ...
##  $ num_self_hrefs               : num  2 1 1 0 19 2 20 20 0 1 ...
##  $ num_imgs                     : num  1 1 1 1 20 0 20 20 0 1 ...
##  $ num_videos                   : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ average_token_length         : num  4.68 4.91 4.39 4.4 4.68 ...
##  $ num_keywords                 : num  5 4 6 7 7 9 10 9 7 5 ...
##  $ data_channel_is_lifestyle    : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ data_channel_is_entertainment: num  1 0 0 1 0 0 0 0 0 0 ...
##  $ data_channel_is_bus          : num  0 1 1 0 0 0 0 0 0 0 ...
##  $ data_channel_is_socmed       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ data_channel_is_tech         : num  0 0 0 0 1 1 0 1 1 0 ...
##  $ data_channel_is_world        : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ kw_min_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_min                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_min_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_max                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_min_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_max_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ kw_avg_avg                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ self_reference_min_shares    : num  496 0 918 0 545 8500 545 545 0 0 ...
##  $ self_reference_max_shares    : num  496 0 918 0 16000 8500 16000 16000 0 0 ...
##  $ self_reference_avg_sharess   : num  496 0 918 0 3151 ...
##  $ weekday_is_monday            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_tuesday           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_wednesday         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_thursday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_friday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_saturday          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday_is_sunday            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_weekend                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LDA_00                       : num  0.5003 0.7998 0.2178 0.0286 0.0286 ...
##  $ LDA_01                       : num  0.3783 0.05 0.0333 0.4193 0.0288 ...
##  $ LDA_02                       : num  0.04 0.0501 0.0334 0.4947 0.0286 ...
##  $ LDA_03                       : num  0.0413 0.0501 0.0333 0.0289 0.0286 ...
##  $ LDA_04                       : num  0.0401 0.05 0.6822 0.0286 0.8854 ...
##  $ global_subjectivity          : num  0.522 0.341 0.702 0.43 0.514 ...
##  $ global_sentiment_polarity    : num  0.0926 0.1489 0.3233 0.1007 0.281 ...
##  $ global_rate_positive_words   : num  0.0457 0.0431 0.0569 0.0414 0.0746 ...
##  $ global_rate_negative_words   : num  0.0137 0.01569 0.00948 0.02072 0.01213 ...
##  $ rate_positive_words          : num  0.769 0.733 0.857 0.667 0.86 ...
##  $ rate_negative_words          : num  0.231 0.267 0.143 0.333 0.14 ...
##  $ avg_positive_polarity        : num  0.379 0.287 0.496 0.386 0.411 ...
##  $ min_positive_polarity        : num  0.1 0.0333 0.1 0.1364 0.0333 ...
##  $ max_positive_polarity        : num  0.7 0.7 1 0.8 1 0.6 1 1 0.8 0.5 ...
##  $ avg_negative_polarity        : num  -0.35 -0.119 -0.467 -0.37 -0.22 ...
##  $ min_negative_polarity        : num  -0.6 -0.125 -0.8 -0.6 -0.5 -0.4 -0.5 -0.5 -0.125 -0.5 ...
##  $ max_negative_polarity        : num  -0.2 -0.1 -0.133 -0.167 -0.05 ...
##  $ title_subjectivity           : num  0.5 0 0 0 0.455 ...
##  $ title_sentiment_polarity     : num  -0.188 0 0 0 0.136 ...
##  $ abs_title_subjectivity       : num  0 0.5 0.5 0.5 0.0455 ...
##  $ abs_title_sentiment_polarity : num  0.188 0 0 0 0.136 ...
##  $ shares                       : int  593 711 1500 1200 505 855 556 891 3600 710 ...

With url removed we also check if there are any nulls or NA in the data , turns out there arent any NULLs in the data.

## [1] FALSE

Cleaning: Handle Categorical Data

Looking at the distributions of all the columns above we can conclude that all the columns 14-19 and 32-39 (see the names below) have binary data, hence it would be better to convert these into a factor.

##  [1] "data_channel_is_lifestyle"     "data_channel_is_entertainment"
##  [3] "data_channel_is_bus"           "data_channel_is_socmed"       
##  [5] "data_channel_is_tech"          "data_channel_is_world"        
##  [7] "weekday_is_monday"             "weekday_is_tuesday"           
##  [9] "weekday_is_wednesday"          "weekday_is_thursday"          
## [11] "weekday_is_friday"             "weekday_is_saturday"          
## [13] "weekday_is_sunday"             "is_weekend"

Converting the columns mentioned above to factors.

## 'data.frame':    39644 obs. of  14 variables:
##  $ data_channel_is_lifestyle    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ data_channel_is_entertainment: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 1 1 1 ...
##  $ data_channel_is_bus          : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
##  $ data_channel_is_socmed       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ data_channel_is_tech         : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 1 2 2 1 ...
##  $ data_channel_is_world        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
##  $ weekday_is_monday            : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weekday_is_tuesday           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_wednesday         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_thursday          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_friday            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_saturday          : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ weekday_is_sunday            : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ is_weekend                   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Cleaning: Handle Missing Data

Missing values are coded as 0 in this dataset.Keeping apart the binary data columns there are around 3000 rows with missing data that needs to be cleaned first.

Columns with missing data:

## [1] "num_videos"                 "kw_min_min"                
## [3] "LDA_04"                     "global_subjectivity"       
## [5] "global_sentiment_polarity"  "global_rate_negative_words"
## [7] "rate_positive_words"        "rate_negative_words"       
## [9] "max_positive_polarity"

Total records vs no. of unclean records

## [1] "Total Records: 39644 Unclean Records: 1217 i.e. 3.06982141055393 %"

Since the number of unclean rows is close to 3% , hence we can omit the bad rows.

Transforming: Handle Skewness in Distributions

Some of the variables have heavily right skewed distributions, including the response shares. So we will transform them to reduce the skewness. For those variables with all values bigger than 0, we use log, and other variable with 0, we use square root to transform them.We are omitting the response shares here.

Columns undergoing transformation:

##  [1] "n_tokens_title"             "n_non_stop_unique_tokens"  
##  [3] "num_hrefs"                  "num_self_hrefs"            
##  [5] "num_imgs"                   "kw_max_avg"                
##  [7] "kw_avg_avg"                 "self_reference_min_shares" 
##  [9] "self_reference_max_shares"  "LDA_00"                    
## [11] "LDA_01"                     "LDA_02"                    
## [13] "LDA_03"                     "global_rate_positive_words"

Exploration: Number of shares by the weekday

Observing the pattern of number of shares on each week day, it was found that publishing day didn’t show much influence on shares. We created a new categorical column called as news_day using the weekday_is_* columns to get all the days in a single column. This helped in formatting the patterns in the plot more efficiently.

So we get rid of all the indicators but leave is_weekend because some difference is there bewtween weekdays and weekend data.

Cleaning: Remove unwanted Columns

Removing datachannel created for the plot above.

Also deleting the url and timedelta columns.

Deleting column n_non_stop_words since it has only one value , hence its a constant.

Transformation: Generate binary response variable

Define articles with shares larger than 1400 (median) as popular article.

Modeling

Since our target is a class i.e. 1 for a popular aricle and 0 for a non popular article , we will apply classification methods like Knn , Classification and Regression Trees,C5.0 Trees and Random Forests to train our data and predict on a test set. We will generate the training and test sets for the same.

Check for collinearity

We need to see if any numerical columms are collinear to each before applying our algorithms.

clusterV columns
rate_positive_words 32 rate_positive_words
rate_negative_words 32 rate_negative_words

From the cluster 32 we can see that rate_positive_words and rate_negative_words. Hence, we reomve rate_negative_words

LDA

Train the model on train set

##             Length Class      Mode     
## prior        2     -none-     numeric  
## counts       2     -none-     numeric  
## means       98     -none-     numeric  
## scaling     49     -none-     numeric  
## lev          2     -none-     character
## svd          1     -none-     numeric  
## N            1     -none-     numeric  
## call         3     -none-     call     
## xNames      49     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        0     -none-     list

Predict on test set

Confusion matrix for test set

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3928 2182
##          1 1798 3473
##                                           
##                Accuracy : 0.6503          
##                  95% CI : (0.6415, 0.6591)
##     No Information Rate : 0.5031          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3003          
##  Mcnemar's Test P-Value : 1.271e-09       
##                                           
##             Sensitivity : 0.6860          
##             Specificity : 0.6141          
##          Pos Pred Value : 0.6429          
##          Neg Pred Value : 0.6589          
##              Prevalence : 0.5031          
##          Detection Rate : 0.3451          
##    Detection Prevalence : 0.5369          
##       Balanced Accuracy : 0.6501          
##                                           
##        'Positive' Class : 0               
## 

ROC Curve

QDA

Train the model on train set

##             Length Class      Mode     
## prior          2   -none-     numeric  
## counts         2   -none-     numeric  
## means         98   -none-     numeric  
## scaling     4802   -none-     numeric  
## ldet           2   -none-     numeric  
## lev            2   -none-     character
## N              1   -none-     numeric  
## call           3   -none-     call     
## xNames        49   -none-     character
## problemType    1   -none-     character
## tuneValue      1   data.frame list     
## obsLevels      2   -none-     character
## param          0   -none-     list

Predict on test set

Confusion matrix for test set

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4157 2582
##          1 1569 3073
##                                           
##                Accuracy : 0.6353          
##                  95% CI : (0.6263, 0.6441)
##     No Information Rate : 0.5031          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2697          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7260          
##             Specificity : 0.5434          
##          Pos Pred Value : 0.6169          
##          Neg Pred Value : 0.6620          
##              Prevalence : 0.5031          
##          Detection Rate : 0.3653          
##    Detection Prevalence : 0.5921          
##       Balanced Accuracy : 0.6347          
##                                           
##        'Positive' Class : 0               
## 

ROC Curve

Logistic Regression

Train the model on train set

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3201  -1.0224  -0.6162   1.0655   2.6180  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -3.540e+00  6.452e-01  -5.487 4.08e-08 ***
## n_tokens_title                 -1.391e-02  6.447e-02  -0.216 0.829129    
## n_tokens_content                1.708e-04  4.953e-05   3.448 0.000564 ***
## n_unique_tokens                -5.583e-02  3.746e-01  -0.149 0.881539    
## n_non_stop_unique_tokens       -3.908e-01  2.039e-01  -1.917 0.055262 .  
## num_hrefs                       7.959e-02  1.306e-02   6.093 1.11e-09 ***
## num_self_hrefs                 -1.714e-01  1.884e-02  -9.099  < 2e-16 ***
## num_imgs                        3.671e-03  1.269e-02   0.289 0.772317    
## num_videos                     -5.626e-03  3.581e-03  -1.571 0.116138    
## average_token_length           -1.049e-01  5.428e-02  -1.934 0.053164 .  
## num_keywords                    4.150e-02  8.842e-03   4.693 2.69e-06 ***
## data_channel_is_lifestyle1     -1.298e-01  8.648e-02  -1.501 0.133248    
## data_channel_is_entertainment1 -3.745e-01  5.705e-02  -6.564 5.24e-11 ***
## data_channel_is_bus1           -7.764e-02  7.785e-02  -0.997 0.318602    
## data_channel_is_socmed1         7.023e-01  8.363e-02   8.399  < 2e-16 ***
## data_channel_is_tech1           4.876e-01  8.035e-02   6.069 1.29e-09 ***
## data_channel_is_world1         -1.048e-01  7.728e-02  -1.357 0.174888    
## kw_min_min                      1.410e-03  3.719e-04   3.790 0.000150 ***
## kw_max_min                      5.513e-06  1.082e-05   0.510 0.610326    
## kw_avg_min                     -1.322e-04  7.115e-05  -1.858 0.063146 .  
## kw_min_max                     -6.586e-07  2.613e-07  -2.520 0.011731 *  
## kw_max_max                     -5.950e-07  1.340e-07  -4.440 9.02e-06 ***
## kw_avg_max                     -6.041e-07  1.887e-07  -3.202 0.001366 ** 
## kw_min_avg                     -5.739e-05  1.797e-05  -3.194 0.001403 ** 
## kw_max_avg                     -1.278e-02  1.590e-03  -8.038 9.11e-16 ***
## kw_avg_avg                      7.115e-02  4.357e-03  16.332  < 2e-16 ***
## self_reference_min_shares       5.011e-03  4.125e-04  12.150  < 2e-16 ***
## self_reference_max_shares       3.102e-03  3.223e-04   9.624  < 2e-16 ***
## self_reference_avg_sharess     -8.821e-06  9.027e-07  -9.772  < 2e-16 ***
## is_weekend1                     8.444e-01  4.094e-02  20.626  < 2e-16 ***
## LDA_00                          1.390e-01  1.843e-02   7.544 4.57e-14 ***
## LDA_01                         -2.372e-02  1.594e-02  -1.489 0.136602    
## LDA_02                         -3.471e-02  1.771e-02  -1.960 0.049951 *  
## LDA_03                          4.222e-03  1.756e-02   0.241 0.809930    
## LDA_04                          2.349e-01  1.023e-01   2.297 0.021633 *  
## global_subjectivity             1.089e+00  1.919e-01   5.677 1.37e-08 ***
## global_sentiment_polarity      -7.936e-02  3.737e-01  -0.212 0.831829    
## global_rate_positive_words     -4.160e-02  7.552e-02  -0.551 0.581741    
## global_rate_negative_words      3.547e+00  3.745e+00   0.947 0.343546    
## rate_positive_words             4.127e-01  3.127e-01   1.320 0.186852    
## avg_positive_polarity          -4.401e-01  3.074e-01  -1.432 0.152217    
## min_positive_polarity          -4.086e-01  2.552e-01  -1.601 0.109382    
## max_positive_polarity           9.209e-03  9.681e-02   0.095 0.924210    
## avg_negative_polarity          -4.970e-02  2.821e-01  -0.176 0.860152    
## min_negative_polarity           7.079e-02  1.036e-01   0.683 0.494327    
## max_negative_polarity           1.395e-01  2.337e-01   0.597 0.550546    
## title_subjectivity              9.030e-02  6.177e-02   1.462 0.143765    
## title_sentiment_polarity        1.567e-01  5.726e-02   2.736 0.006220 ** 
## abs_title_subjectivity          2.970e-01  8.260e-02   3.596 0.000324 ***
## abs_title_sentiment_polarity    1.814e-02  9.001e-02   0.201 0.840309    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37482  on 27045  degrees of freedom
## Residual deviance: 33722  on 26996  degrees of freedom
## AIC: 33822
## 
## Number of Fisher Scoring iterations: 4

Predict on test set

Confusion matrix for test set

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3926 2172
##          1 1800 3483
##                                           
##                Accuracy : 0.651           
##                  95% CI : (0.6422, 0.6598)
##     No Information Rate : 0.5031          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3017          
##  Mcnemar's Test P-Value : 3.941e-09       
##                                           
##             Sensitivity : 0.6856          
##             Specificity : 0.6159          
##          Pos Pred Value : 0.6438          
##          Neg Pred Value : 0.6593          
##              Prevalence : 0.5031          
##          Detection Rate : 0.3450          
##    Detection Prevalence : 0.5358          
##       Balanced Accuracy : 0.6508          
##                                           
##        'Positive' Class : 0               
## 

ROC Curve

KNN

Train the model on train set

##         Length Class  Mode   
## learn   2      -none- list   
## k       1      -none- numeric
## terms   3      terms  call   
## xlevels 7      -none- list   
## theDots 0      -none- list

Predict on test set

Confusion matrix for test set

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3321 2723
##          1 2405 2932
##                                           
##                Accuracy : 0.5494          
##                  95% CI : (0.5402, 0.5586)
##     No Information Rate : 0.5031          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.0985          
##  Mcnemar's Test P-Value : 9.566e-06       
##                                           
##             Sensitivity : 0.5800          
##             Specificity : 0.5185          
##          Pos Pred Value : 0.5495          
##          Neg Pred Value : 0.5494          
##              Prevalence : 0.5031          
##          Detection Rate : 0.2918          
##    Detection Prevalence : 0.5311          
##       Balanced Accuracy : 0.5492          
##                                           
##        'Positive' Class : 0               
## 

ROC Curve

CART

Train the model on train set

## Call:
## rpart(formula = shares ~ ., data = news[ind == 1, ], method = "class")
##   n= 27046 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.18105597      0 1.0000000 1.0000000 0.006209699
## 2 0.02711685      1 0.8189440 0.8202281 0.006089356
## 3 0.01329406      2 0.7918272 0.7923559 0.006052834
## 4 0.01000000      4 0.7652391 0.7681849 0.006017118
## 
## Variable importance
##                    kw_avg_avg                    kw_max_avg 
##                            24                            15 
##                    kw_min_avg                    kw_min_max 
##                            10                             9 
##                        LDA_03                    kw_avg_max 
##                             8                             8 
## data_channel_is_entertainment          data_channel_is_tech 
##                             7                             6 
##        data_channel_is_socmed                        LDA_04 
##                             6                             4 
##                        LDA_01 
##                             2 
## 
## Node number 1: 27046 observations,    complexity param=0.181056
##   predicted class=0  expected loss=0.4894994  P(node) =1
##     class counts: 13807 13239
##    probabilities: 0.511 0.489 
##   left son=2 (13891 obs) right son=3 (13155 obs)
##   Primary splits:
##       kw_avg_avg                 < 53.68422   to the left,  improve=528.8528, (0 missing)
##       kw_max_avg                 < 61.3408    to the left,  improve=448.9506, (0 missing)
##       self_reference_min_shares  < 40.61553   to the left,  improve=418.7020, (0 missing)
##       self_reference_avg_sharess < 1896.167   to the left,  improve=397.4711, (0 missing)
##       LDA_02                     < -0.6717228 to the right, improve=343.2252, (0 missing)
##   Surrogate splits:
##       kw_max_avg < 65.87515   to the left,  agree=0.821, adj=0.631, (0 split)
##       kw_min_avg < 1692.644   to the left,  agree=0.725, adj=0.435, (0 split)
##       kw_min_max < 2950       to the left,  agree=0.696, adj=0.375, (0 split)
##       LDA_03     < -2.995495  to the left,  agree=0.685, adj=0.352, (0 split)
##       kw_avg_max < 283366.9   to the left,  agree=0.672, adj=0.325, (0 split)
## 
## Node number 2: 13891 observations,    complexity param=0.01329406
##   predicted class=0  expected loss=0.3932762  P(node) =0.5136064
##     class counts:  8428  5463
##    probabilities: 0.607 0.393 
##   left son=4 (10673 obs) right son=5 (3218 obs)
##   Primary splits:
##       data_channel_is_tech       splits as  LR, improve=140.9526, (0 missing)
##       self_reference_avg_sharess < 1866.833   to the left,  improve=129.0816, (0 missing)
##       kw_avg_max                 < 146818.8   to the right, improve=126.8164, (0 missing)
##       is_weekend                 splits as  LR, improve=126.3479, (0 missing)
##       self_reference_min_shares  < 39.36492   to the left,  improve=123.2359, (0 missing)
##   Surrogate splits:
##       LDA_04               < 0.5105669  to the left,  agree=0.898, adj=0.558, (0 split)
##       num_self_hrefs       < 3.239451   to the left,  agree=0.772, adj=0.016, (0 split)
##       average_token_length < 4.148773   to the right, agree=0.771, adj=0.013, (0 split)
##       n_unique_tokens      < 0.3240034  to the right, agree=0.769, adj=0.001, (0 split)
##       LDA_03               < -4.003425  to the right, agree=0.769, adj=0.001, (0 split)
## 
## Node number 3: 13155 observations,    complexity param=0.02711685
##   predicted class=1  expected loss=0.408894  P(node) =0.4863936
##     class counts:  5379  7776
##    probabilities: 0.409 0.591 
##   left son=6 (2611 obs) right son=7 (10544 obs)
##   Primary splits:
##       data_channel_is_entertainment splits as  RL, improve=166.48210, (0 missing)
##       self_reference_min_shares     < 40.61553   to the left,  improve=141.94850, (0 missing)
##       self_reference_avg_sharess    < 2974.167   to the left,  improve=117.24110, (0 missing)
##       is_weekend                    splits as  LR, improve= 93.10397, (0 missing)
##       self_reference_max_shares     < 55.22495   to the left,  improve= 92.74248, (0 missing)
##   Surrogate splits:
##       LDA_01                   < -0.7274334 to the right, agree=0.864, adj=0.316, (0 split)
##       num_videos               < 21.5       to the right, agree=0.805, adj=0.018, (0 split)
##       num_imgs                 < 7.035534   to the right, agree=0.805, adj=0.015, (0 split)
##       n_non_stop_unique_tokens < -1.319281  to the left,  agree=0.802, adj=0.003, (0 split)
##       average_token_length     < 3.803062   to the left,  agree=0.802, adj=0.002, (0 split)
## 
## Node number 4: 10673 observations,    complexity param=0.01329406
##   predicted class=0  expected loss=0.3541647  P(node) =0.394624
##     class counts:  6893  3780
##    probabilities: 0.646 0.354 
##   left son=8 (10107 obs) right son=9 (566 obs)
##   Primary splits:
##       data_channel_is_socmed    splits as  LR, improve=127.07840, (0 missing)
##       kw_avg_max                < 143976.2   to the right, improve=116.11490, (0 missing)
##       kw_max_max                < 654150     to the right, improve=107.19990, (0 missing)
##       self_reference_min_shares < 40.61553   to the left,  improve=102.78630, (0 missing)
##       kw_min_min                < 122.5      to the left,  improve= 95.03958, (0 missing)
##   Surrogate splits:
##       num_self_hrefs < 6.36384    to the left,  agree=0.947, adj=0.009, (0 split)
##       num_keywords   < 2.5        to the right, agree=0.947, adj=0.009, (0 split)
## 
## Node number 5: 3218 observations
##   predicted class=1  expected loss=0.4770044  P(node) =0.1189825
##     class counts:  1535  1683
##    probabilities: 0.477 0.523 
## 
## Node number 6: 2611 observations
##   predicted class=0  expected loss=0.4312524  P(node) =0.09653923
##     class counts:  1485  1126
##    probabilities: 0.569 0.431 
## 
## Node number 7: 10544 observations
##   predicted class=1  expected loss=0.3693096  P(node) =0.3898543
##     class counts:  3894  6650
##    probabilities: 0.369 0.631 
## 
## Node number 8: 10107 observations
##   predicted class=0  expected loss=0.3359058  P(node) =0.3736967
##     class counts:  6712  3395
##    probabilities: 0.664 0.336 
## 
## Node number 9: 566 observations
##   predicted class=1  expected loss=0.319788  P(node) =0.02092731
##     class counts:   181   385
##    probabilities: 0.320 0.680

Plot tree

Predict on test set

Confusion matrix for test set

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3414 1983
##          1 2312 3672
##                                           
##                Accuracy : 0.6226          
##                  95% CI : (0.6136, 0.6315)
##     No Information Rate : 0.5031          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2455          
##  Mcnemar's Test P-Value : 5.59e-07        
##                                           
##             Sensitivity : 0.5962          
##             Specificity : 0.6493          
##          Pos Pred Value : 0.6326          
##          Neg Pred Value : 0.6136          
##              Prevalence : 0.5031          
##          Detection Rate : 0.3000          
##    Detection Prevalence : 0.4742          
##       Balanced Accuracy : 0.6228          
##                                           
##        'Positive' Class : 0               
## 

ROC Curve

C5.0

Train the model on train set

## 
## Call:
## C5.0.formula(formula = shares ~ ., data = news[ind == 1, ], method
##  = "class")
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Sun Mar 24 04:14:21 2019
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 27046 cases (50 attributes) from undefined.data
## 
## Decision tree:
## 
## kw_avg_avg <= 53.68346:
## :...data_channel_is_socmed = 1:
## :   :...kw_min_max > 3200: 0 (46/15)
## :   :   kw_min_max <= 3200:
## :   :   :...min_positive_polarity <= 0.03333334: 1 (272/53)
## :   :       min_positive_polarity > 0.03333334:
## :   :       :...self_reference_min_shares > 60.82763: 1 (49/7)
## :   :           self_reference_min_shares <= 60.82763:
## :   :           :...avg_negative_polarity <= -0.4787698: 0 (12/1)
## :   :               avg_negative_polarity > -0.4787698:
## :   :               :...LDA_00 <= -2.992677:
## :   :                   :...global_rate_negative_words > 0.02540416: 0 (11)
## :   :                   :   global_rate_negative_words <= 0.02540416:
## :   :                   :   :...n_tokens_title <= 2.302585: 1 (28/11)
## :   :                   :       n_tokens_title > 2.302585: 0 (25/6)
## :   :                   LDA_00 > -2.992677:
## :   :                   :...rate_positive_words > 0.9152542: 1 (16)
## :   :                       rate_positive_words <= 0.9152542:
## :   :                       :...global_sentiment_polarity <= 0.2488831: 1 (98/30)
## :   :                           global_sentiment_polarity > 0.2488831: 0 (9/1)
## :   data_channel_is_socmed = 0:
## :   :...is_weekend = 1:
## :       :...data_channel_is_tech = 1: 1 (364/86)
## :       :   data_channel_is_tech = 0:
## :       :   :...kw_min_min <= 129:
## :       :       :...LDA_00 > -1.733872: 1 (196/64)
## :       :       :   LDA_00 <= -1.733872:
## :       :       :   :...kw_max_max <= 690400:
## :       :       :       :...title_sentiment_polarity <= -0.15: 0 (6)
## :       :       :       :   title_sentiment_polarity > -0.15: 1 (52/16)
## :       :       :       kw_max_max > 690400:
## :       :       :       :...num_hrefs > 1.414214: 0 (593/253)
## :       :       :           num_hrefs <= 1.414214:
## :       :       :           :...kw_avg_avg <= 53.26955: 0 (53/7)
## :       :       :               kw_avg_avg > 53.26955: 1 (4)
## :       :       kw_min_min > 129:
## :       :       :...abs_title_sentiment_polarity > 0.5125: 1 (19)
## :       :           abs_title_sentiment_polarity <= 0.5125:
## :       :           :...data_channel_is_entertainment = 0: 1 (120/35)
## :       :               data_channel_is_entertainment = 1:
## :       :               :...LDA_01 <= -3.331762: 0 (9)
## :       :                   LDA_01 > -3.331762:
## :       :                   :...num_videos > 1: 0 (3)
## :       :                       num_videos <= 1:
## :       :                       :...min_positive_polarity > 0.0625: 1 (13)
## :       :                           min_positive_polarity <= 0.0625:
## :       :                           :...num_keywords <= 9: 1 (6/1)
## :       :                               num_keywords > 9: 0 (5)
## :       is_weekend = 0:
## :       :...data_channel_is_tech = 1:
## :           :...n_unique_tokens <= 0.3858625: 1 (237/78)
## :           :   n_unique_tokens > 0.3858625:
## :           :   :...kw_min_min > 88: 1 (462/187)
## :           :       kw_min_min <= 88:
## :           :       :...n_non_stop_unique_tokens > -0.1306202:
## :           :           :...num_hrefs <= 1.414214: 1 (5)
## :           :           :   num_hrefs > 1.414214: 0 (64/9)
## :           :           n_non_stop_unique_tokens <= -0.1306202:
## :           :           :...self_reference_avg_sharess <= 1788.667:
## :           :               :...num_keywords <= 8:
## :           :               :   :...global_rate_negative_words <= 0.02448709: 0 (349/132)
## :           :               :   :   global_rate_negative_words > 0.02448709: 1 (57/21)
## :           :               :   num_keywords > 8:
## :           :               :   :...kw_min_avg > 1668.8:
## :           :               :       :...kw_min_max <= 12700: 1 (14/3)
## :           :               :       :   kw_min_max > 12700: 0 (2)
## :           :               :       kw_min_avg <= 1668.8:
## :           :               :       :...max_positive_polarity <= 0.75: 0 (104/14)
## :           :               :           max_positive_polarity > 0.75:
## :           :               :           :...avg_negative_polarity <= -0.3720588: 1 (8/1)
## :           :               :               avg_negative_polarity > -0.3720588: 0 (142/45)
## :           :               self_reference_avg_sharess > 1788.667:
## :           :               :...kw_avg_avg <= 50.30047:
## :           :                   :...global_rate_positive_words <= -3.226256:
## :           :                   :   :...num_imgs <= 0: 1 (46/14)
## :           :                   :   :   num_imgs > 0: 0 (295/146)
## :           :                   :   global_rate_positive_words > -3.226256:
## :           :                   :   :...title_sentiment_polarity <= 0.2590909: 0 (331/112)
## :           :                   :       title_sentiment_polarity > 0.2590909:
## :           :                   :       :...kw_max_avg <= 57.11774: 0 (7)
## :           :                   :           kw_max_avg > 57.11774: 1 (66/27)
## :           :                   kw_avg_avg > 50.30047:
## :           :                   :...kw_min_min <= 0: 1 (399/152)
## :           :                       kw_min_min > 0:
## :           :                       :...n_unique_tokens <= 0.5252708:
## :           :                           :...avg_positive_polarity > 0.4522186: 0 (6)
## :           :                           :   avg_positive_polarity <= 0.4522186:
## :           :                           :   :...num_keywords > 5: 1 (90/21)
## :           :                           :       num_keywords <= 5: [S1]
## :           :                           n_unique_tokens > 0.5252708:
## :           :                           :...num_imgs > 1.414214: 0 (10)
## :           :                               num_imgs <= 1.414214: [S2]
## :           data_channel_is_tech = 0:
## :           :...kw_max_max <= 617900:
## :               :...global_subjectivity <= 0.3330598: 0 (155/41)
## :               :   global_subjectivity > 0.3330598:
## :               :   :...n_tokens_title <= 1.94591:
## :               :       :...kw_min_avg <= 383: 1 (122/42)
## :               :       :   kw_min_avg > 383:
## :               :       :   :...num_keywords <= 4:
## :               :       :       :...num_hrefs <= 2.645751: 1 (10)
## :               :       :       :   num_hrefs > 2.645751: 0 (4/1)
## :               :       :       num_keywords > 4:
## :               :       :       :...kw_max_min <= 1400: 0 (18)
## :               :       :           kw_max_min > 1400:
## :               :       :           :...num_self_hrefs <= 1: 0 (3)
## :               :       :               num_self_hrefs > 1: 1 (4)
## :               :       n_tokens_title > 1.94591:
## :               :       :...num_keywords <= 4: 0 (115/35)
## :               :           num_keywords > 4:
## :               :           :...self_reference_avg_sharess > 1991.4:
## :               :               :...data_channel_is_entertainment = 0: 1 (382/163)
## :               :               :   data_channel_is_entertainment = 1:
## :               :               :   :...n_tokens_title > 2.484907: 1 (10/1)
## :               :               :       n_tokens_title <= 2.484907:
## :               :               :       :...n_tokens_title <= 2.397895: 0 (78/31)
## :               :               :           n_tokens_title > 2.397895:
## :               :               :           :...num_hrefs <= 2.828427: 0 (4)
## :               :               :               num_hrefs > 2.828427: 1 (6)
## :               :               self_reference_avg_sharess <= 1991.4:
## :               :               :...num_self_hrefs > 2:
## :               :                   :...self_reference_max_shares <= 31: 1 (12/3)
## :               :                   :   self_reference_max_shares > 31:
## :               :                   :   :...num_videos <= 1: 0 (60/8)
## :               :                   :       num_videos > 1: [S3]
## :               :                   num_self_hrefs <= 2:
## :               :                   :...n_tokens_title > 2.564949:
## :               :                       :...num_self_hrefs <= 1.414214: 1 (18/2)
## :               :                       :   num_self_hrefs > 1.414214: [S4]
## :               :                       n_tokens_title <= 2.564949:
## :               :                       :...min_negative_polarity > -1: 0 (646/264)
## :               :                           min_negative_polarity <= -1: [S5]
## :               kw_max_max > 617900:
## :               :...data_channel_is_lifestyle = 1:
## :                   :...kw_max_max <= 690400: 1 (72/29)
## :                   :   kw_max_max > 690400: 0 (160/65)
## :                   data_channel_is_lifestyle = 0:
## :                   :...self_reference_avg_sharess > 3185.5:
## :                       :...kw_max_avg > 62.44197:
## :                       :   :...LDA_00 > -0.1674443: 1 (43/9)
## :                       :   :   LDA_00 <= -0.1674443:
## :                       :   :   :...n_tokens_title > 2.639057: 1 (25/6)
## :                       :   :       n_tokens_title <= 2.639057:
## :                       :   :       :...num_hrefs <= 3.162278: 0 (373/140)
## :                       :   :           num_hrefs > 3.162278: [S6]
## :                       :   kw_max_avg <= 62.44197:
## :                       :   :...LDA_04 <= 0.02857156: 0 (134/23)
## :                       :       LDA_04 > 0.02857156:
## :                       :       :...data_channel_is_entertainment = 0:
## :                       :           :...global_subjectivity <= 0.4697912:
## :                       :           :   :...n_unique_tokens <= 0.3952226: 1 (25/7)
## :                       :           :   :   n_unique_tokens > 0.3952226: 0 (575/186)
## :                       :           :   global_subjectivity > 0.4697912:
## :                       :           :   :...max_negative_polarity <= -0.375: 0 (11/1)
## :                       :           :       max_negative_polarity > -0.375: [S7]
## :                       :           data_channel_is_entertainment = 1:
## :                       :           :...kw_avg_min > 247.25: 1 (54/25)
## :                       :               kw_avg_min <= 247.25:
## :                       :               :...LDA_00 > -3.555154: 0 (152/25)
## :                       :                   LDA_00 <= -3.555154:
## :                       :                   :...LDA_00 <= -3.555281: 0 (18/6)
## :                       :                       LDA_00 > -3.555281: 1 (8)
## :                       self_reference_avg_sharess <= 3185.5:
## :                       :...num_self_hrefs > 3.741657:
## :                           :...max_positive_polarity <= 0.9: 0 (5)
## :                           :   max_positive_polarity > 0.9: 1 (21/6)
## :                           num_self_hrefs <= 3.741657:
## :                           :...kw_max_avg <= 60.29007: 0 (3206/630)
## :                               kw_max_avg > 60.29007:
## :                               :...data_channel_is_entertainment = 1: 0 (564/128)
## :                                   data_channel_is_entertainment = 0:
## :                                   :...num_hrefs > 5.09902: 1 (45/18)
## :                                       num_hrefs <= 5.09902:
## :                                       :...LDA_02 > -0.3257387: 0 (396/85)
## :                                           LDA_02 <= -0.3257387: [S8]
## kw_avg_avg > 53.68346:
## :...data_channel_is_entertainment = 1:
##     :...is_weekend = 1:
##     :   :...global_subjectivity > 0.3559193:
##     :   :   :...kw_max_avg <= 60.61333: 0 (23/9)
##     :   :   :   kw_max_avg > 60.61333: 1 (318/106)
##     :   :   global_subjectivity <= 0.3559193:
##     :   :   :...num_hrefs <= 2.44949: 0 (24/8)
##     :   :       num_hrefs > 2.44949:
##     :   :       :...max_negative_polarity <= -0.075: 0 (10/2)
##     :   :           max_negative_polarity > -0.075: 1 (7/1)
##     :   is_weekend = 0:
##     :   :...kw_avg_avg <= 64.43067:
##     :       :...kw_max_max <= 617900:
##     :       :   :...global_rate_positive_words <= -3.676709: 1 (9)
##     :       :   :   global_rate_positive_words > -3.676709:
##     :       :   :   :...kw_min_min <= 88: 1 (27/11)
##     :       :   :       kw_min_min > 88: 0 (68/27)
##     :       :   kw_max_max > 617900:
##     :       :   :...num_keywords > 7:
##     :       :       :...min_negative_polarity <= -0.875: 1 (118/54)
##     :       :       :   min_negative_polarity > -0.875: 0 (431/166)
##     :       :       num_keywords <= 7:
##     :       :       :...self_reference_avg_sharess <= 3960: 0 (755/209)
##     :       :           self_reference_avg_sharess > 3960:
##     :       :           :...LDA_02 > -1.12707: 0 (28/3)
##     :       :               LDA_02 <= -1.12707:
##     :       :               :...num_self_hrefs <= 2.645751: 0 (310/127)
##     :       :                   num_self_hrefs > 2.645751: 1 (26/8)
##     :       kw_avg_avg > 64.43067:
##     :       :...max_positive_polarity <= 0.55: 0 (72/27)
##     :           max_positive_polarity > 0.55:
##     :           :...num_imgs > 1: 1 (203/70)
##     :               num_imgs <= 1:
##     :               :...abs_title_subjectivity <= 0.1375:
##     :                   :...avg_positive_polarity <= 0.485: 0 (38/7)
##     :                   :   avg_positive_polarity > 0.485: 1 (4)
##     :                   abs_title_subjectivity > 0.1375:
##     :                   :...kw_min_max > 84800:
##     :                       :...title_subjectivity <= 0.5833333: 0 (7/1)
##     :                       :   title_subjectivity > 0.5833333: 1 (2)
##     :                       kw_min_max <= 84800:
##     :                       :...n_tokens_title > 2.484907: 1 (32/6)
##     :                           n_tokens_title <= 2.484907:
##     :                           :...n_unique_tokens <= 0.4723404: 1 (18/2)
##     :                               n_unique_tokens > 0.4723404: 0 (81/34)
##     data_channel_is_entertainment = 0:
##     :...kw_min_max > 690400: 0 (34/7)
##         kw_min_max <= 690400:
##         :...LDA_02 > -0.8554933:
##             :...is_weekend = 1:
##             :   :...self_reference_avg_sharess <= 1242: 0 (43/18)
##             :   :   self_reference_avg_sharess > 1242:
##             :   :   :...num_self_hrefs <= 3.872983: 1 (100/20)
##             :   :       num_self_hrefs > 3.872983: 0 (5/1)
##             :   is_weekend = 0:
##             :   :...LDA_00 > -1.392776: 1 (91/27)
##             :       LDA_00 <= -1.392776:
##             :       :...min_positive_polarity <= 0.03333334:
##             :           :...kw_min_min > 88: 0 (8)
##             :           :   kw_min_min <= 88:
##             :           :   :...self_reference_min_shares > 51.96152:
##             :           :       :...kw_avg_min <= 1788.125: 1 (35/3)
##             :           :       :   kw_avg_min > 1788.125: 0 (2)
##             :           :       self_reference_min_shares <= 51.96152:
##             :           :       :...kw_avg_max > 397300: 1 (7)
##             :           :           kw_avg_max <= 397300:
##             :           :           :...kw_min_max <= 38900: 1 (94/45)
##             :           :               kw_min_max > 38900: 0 (7)
##             :           min_positive_polarity > 0.03333334:
##             :           :...self_reference_max_shares <= 40: 0 (258/67)
##             :               self_reference_max_shares > 40:
##             :               :...num_keywords <= 4:
##             :                   :...min_positive_polarity > 0.05: 0 (21/1)
##             :                   :   min_positive_polarity <= 0.05:
##             :                   :   :...max_positive_polarity <= 0.75: 0 (3)
##             :                   :       max_positive_polarity > 0.75: 1 (4)
##             :                   num_keywords > 4:
##             :                   :...kw_min_avg > 2151.778: 1 (98/35)
##             :                       kw_min_avg <= 2151.778:
##             :                       :...kw_max_max <= 617900:
##             :                           :...kw_max_min <= 531: 0 (3)
##             :                           :   kw_max_min > 531: 1 (14/1)
##             :                           kw_max_max > 617900:
##             :                           :...num_keywords <= 9: 0 (227/77)
##             :                               num_keywords > 9: 1 (71/33)
##             LDA_02 <= -0.8554933:
##             :...is_weekend = 1: 1 (1459/338)
##                 is_weekend = 0:
##                 :...n_unique_tokens <= 0.4378172:
##                     :...data_channel_is_lifestyle = 0:
##                     :   :...min_positive_polarity <= 0.05: 1 (441/67)
##                     :   :   min_positive_polarity > 0.05:
##                     :   :   :...abs_title_subjectivity > 0.04666667: 1 (298/74)
##                     :   :       abs_title_subjectivity <= 0.04666667:
##                     :   :       :...data_channel_is_socmed = 0: 0 (40/16)
##                     :   :           data_channel_is_socmed = 1: 1 (4)
##                     :   data_channel_is_lifestyle = 1:
##                     :   :...kw_min_avg <= 1116.552: 0 (61/29)
##                     :       kw_min_avg > 1116.552:
##                     :       :...kw_min_max <= 17000: 1 (47/4)
##                     :           kw_min_max > 17000:
##                     :           :...kw_max_max <= 690400: 1 (3)
##                     :               kw_max_max > 690400: 0 (20/8)
##                     n_unique_tokens > 0.4378172:
##                     :...self_reference_min_shares <= 45.82576:
##                         :...kw_min_min > 4: 1 (300/92)
##                         :   kw_min_min <= 4:
##                         :   :...num_self_hrefs > 3: 1 (147/40)
##                         :       num_self_hrefs <= 3:
##                         :       :...data_channel_is_socmed = 1:
##                         :           :...min_positive_polarity <= 0.03333334:
##                         :           :   :...n_tokens_content <= 890: 1 (109/18)
##                         :           :   :   n_tokens_content > 890: 0 (8/1)
##                         :           :   min_positive_polarity > 0.03333334:
##                         :           :   :...max_positive_polarity > 0.85:
##                         :           :       :...kw_min_min <= 0: 0 (46/13)
##                         :           :       :   kw_min_min > 0: 1 (25/11)
##                         :           :       max_positive_polarity <= 0.85:
##                         :           :       :...n_tokens_title > 1.791759: [S9]
##                         :           :           n_tokens_title <= 1.791759: [S10]
##                         :           data_channel_is_socmed = 0:
##                         :           :...num_keywords <= 3:
##                         :               :...kw_max_max > 690400: 0 (68/16)
##                         :               :   kw_max_max <= 690400:
##                         :               :   :...num_imgs <= 0: 1 (5)
##                         :               :       num_imgs > 0: 0 (18/6)
##                         :               num_keywords > 3:
##                         :               :...num_imgs > 1:
##                         :                   :...num_keywords <= 4: 0 (48/18)
##                         :                   :   num_keywords > 4: [S11]
##                         :                   num_imgs <= 1:
##                         :                   :...n_tokens_content > 654: 1 (296/113)
##                         :                       n_tokens_content <= 654: [S12]
##                         self_reference_min_shares > 45.82576:
##                         :...n_tokens_content <= 87:
##                             :...n_tokens_title <= 1.94591: 0 (15)
##                             :   n_tokens_title > 1.94591: 1 (32/15)
##                             n_tokens_content > 87:
##                             :...data_channel_is_socmed = 1:
##                                 :...num_videos > 2:
##                                 :   :...num_videos > 7: 1 (12)
##                                 :   :   num_videos <= 7:
##                                 :   :   :...n_unique_tokens <= 0.5530547: 1 (9/2)
##                                 :   :       n_unique_tokens > 0.5530547: 0 (10)
##                                 :   num_videos <= 2:
##                                 :   :...min_positive_polarity > 0.16: [S13]
##                                 :       min_positive_polarity <= 0.16: [S14]
##                                 data_channel_is_socmed = 0:
##                                 :...self_reference_min_shares <= 86.60254:
##                                     :...data_channel_is_tech = 0:
##                                     :   :...kw_avg_avg > 69.36768: 1 (394/108)
##                                     :   :   kw_avg_avg <= 69.36768: [S15]
##                                     :   data_channel_is_tech = 1:
##                                     :   :...num_videos > 1: 1 (31/2)
##                                     :       num_videos <= 1:
##                                     :       :...n_non_stop_unique_tokens > -0.1763514: [S16]
##                                     :           n_non_stop_unique_tokens <= -0.1763514:
##                                     :           :...num_hrefs > 2: 1 (253/59)
##                                     :               num_hrefs <= 2:
##                                     :               :...num_imgs > 2.828427: 0 (8/2)
##                                     :                   num_imgs <= 2.828427: [S17]
##                                     self_reference_min_shares > 86.60254:
##                                     :...average_token_length <= 4.220472:
##                                         :...avg_positive_polarity <= 0.4258903: 1 (24/9)
##                                         :   avg_positive_polarity > 0.4258903: 0 (14)
##                                         average_token_length > 4.220472: [S18]
## 
## SubTree [S1]
## 
## self_reference_min_shares <= 64.03124: 0 (7)
## self_reference_min_shares > 64.03124: 1 (3)
## 
## SubTree [S2]
## 
## self_reference_max_shares > 152.3155: 1 (10)
## self_reference_max_shares <= 152.3155:
## :...kw_max_max > 690400: 0 (92/26)
##     kw_max_max <= 690400:
##     :...average_token_length <= 4.915152: 1 (42/14)
##         average_token_length > 4.915152: 0 (6)
## 
## SubTree [S3]
## 
## max_negative_polarity <= -0.15: 0 (3)
## max_negative_polarity > -0.15: 1 (5)
## 
## SubTree [S4]
## 
## min_positive_polarity <= 0.0625: 1 (2)
## min_positive_polarity > 0.0625: 0 (5)
## 
## SubTree [S5]
## 
## min_positive_polarity > 0.1: 1 (7)
## min_positive_polarity <= 0.1:
## :...num_imgs <= 0: 0 (9/1)
##     num_imgs > 0: 1 (56/22)
## 
## SubTree [S6]
## 
## data_channel_is_entertainment = 0: 1 (124/53)
## data_channel_is_entertainment = 1: 0 (97/42)
## 
## SubTree [S7]
## 
## abs_title_subjectivity <= 0.475: 0 (85/34)
## abs_title_subjectivity > 0.475: 1 (93/31)
## 
## SubTree [S8]
## 
## n_non_stop_unique_tokens <= -0.5900176: 1 (40/13)
## n_non_stop_unique_tokens > -0.5900176:
## :...num_videos <= 0: 0 (731/237)
##     num_videos > 0:
##     :...kw_max_avg <= 67.2759:
##         :...num_imgs <= 1.732051: 0 (94/24)
##         :   num_imgs > 1.732051:
##         :   :...max_positive_polarity <= 0.7: 0 (2)
##         :       max_positive_polarity > 0.7: 1 (5)
##         kw_max_avg > 67.2759:
##         :...avg_positive_polarity <= 0.2941799: 0 (37/10)
##             avg_positive_polarity > 0.2941799:
##             :...num_self_hrefs <= 1.732051: 1 (87/25)
##                 num_self_hrefs > 1.732051: 0 (12/3)
## 
## SubTree [S9]
## 
## global_subjectivity <= 0.6135198: 1 (84/18)
## global_subjectivity > 0.6135198: 0 (13/3)
## 
## SubTree [S10]
## 
## min_positive_polarity > 0.375: 1 (2)
## min_positive_polarity <= 0.375:
## :...LDA_04 <= 0.04000269: 1 (2)
##     LDA_04 > 0.04000269: 0 (11)
## 
## SubTree [S11]
## 
## self_reference_min_shares > 31.62278: 1 (705/251)
## self_reference_min_shares <= 31.62278:
## :...data_channel_is_lifestyle = 0: 1 (400/181)
##     data_channel_is_lifestyle = 1: 0 (41/16)
## 
## SubTree [S12]
## 
## data_channel_is_tech = 0: 0 (1158/504)
## data_channel_is_tech = 1:
## :...num_keywords > 7:
##     :...n_tokens_content <= 208: 0 (44/8)
##     :   n_tokens_content > 208: 1 (129/61)
##     num_keywords <= 7:
##     :...n_tokens_title > 2.079442: 1 (133/43)
##         n_tokens_title <= 2.079442:
##         :...n_tokens_title <= 1.791759: 1 (5)
##             n_tokens_title > 1.791759: 0 (39/13)
## 
## SubTree [S13]
## 
## self_reference_max_shares <= 96.43651: 0 (6)
## self_reference_max_shares > 96.43651: 1 (7)
## 
## SubTree [S14]
## 
## self_reference_max_shares > 66.3325: 1 (174/11)
## self_reference_max_shares <= 66.3325:
## :...n_tokens_content <= 150: 0 (5)
##     n_tokens_content > 150:
##     :...global_rate_negative_words <= 0.02584814: 1 (47/6)
##         global_rate_negative_words > 0.02584814: 0 (9/3)
## 
## SubTree [S15]
## 
## max_negative_polarity <= -0.375: 0 (50/19)
## max_negative_polarity > -0.375: 1 (994/390)
## 
## SubTree [S16]
## 
## global_rate_negative_words <= 0.009920635: 0 (12/1)
## global_rate_negative_words > 0.009920635: 1 (7/2)
## 
## SubTree [S17]
## 
## avg_negative_polarity <= -0.1464286: 1 (57/9)
## avg_negative_polarity > -0.1464286:
## :...num_keywords <= 6: 0 (13)
##     num_keywords > 6: 1 (7/2)
## 
## SubTree [S18]
## 
## data_channel_is_lifestyle = 0: 1 (920/245)
## data_channel_is_lifestyle = 1:
## :...self_reference_min_shares > 210.4756: 1 (10)
##     self_reference_min_shares <= 210.4756:
##     :...self_reference_max_shares > 199.2486: 0 (8)
##         self_reference_max_shares <= 199.2486:
##         :...kw_min_max > 8100: 1 (25/1)
##             kw_min_max <= 8100:
##             :...n_tokens_title > 2.302585: 1 (23/6)
##                 n_tokens_title <= 2.302585:
##                 :...self_reference_max_shares > 179.722: 1 (4)
##                     self_reference_max_shares <= 179.722:
##                     :...num_hrefs <= 4.358899: 0 (23/2)
##                         num_hrefs > 4.358899: 1 (7/2)
## 
## 
## Evaluation on training data (27046 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##     214 8134(30.1%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    9929  3878    (a): class 0
##    4256  8983    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% kw_avg_avg
##   97.78% is_weekend
##   75.75% data_channel_is_socmed
##   63.03% data_channel_is_entertainment
##   61.60% data_channel_is_tech
##   45.40% LDA_02
##   44.65% kw_max_max
##   43.33% self_reference_avg_sharess
##   42.57% n_unique_tokens
##   41.65% kw_min_max
##   37.17% num_self_hrefs
##   35.75% data_channel_is_lifestyle
##   28.77% kw_min_min
##   27.59% num_keywords
##   27.55% kw_max_avg
##   27.54% self_reference_min_shares
##   18.97% n_tokens_content
##   15.37% num_imgs
##   13.01% n_non_stop_unique_tokens
##   11.81% num_hrefs
##   11.10% global_subjectivity
##   10.63% LDA_00
##   10.37% min_positive_polarity
##   10.26% n_tokens_title
##    6.40% num_videos
##    4.68% self_reference_max_shares
##    4.68% min_negative_polarity
##    4.65% max_negative_polarity
##    4.32% LDA_04
##    4.09% average_token_length
##    3.60% kw_min_avg
##    3.45% max_positive_polarity
##    3.14% global_rate_positive_words
##    2.60% abs_title_subjectivity
##    2.02% global_rate_negative_words
##    1.71% title_sentiment_polarity
##    1.58% avg_negative_polarity
##    1.19% avg_positive_polarity
##    0.99% kw_avg_min
##    0.65% abs_title_sentiment_polarity
##    0.45% rate_positive_words
##    0.40% kw_avg_max
##    0.40% global_sentiment_polarity
##    0.16% kw_max_min
##    0.13% LDA_01
##    0.03% title_subjectivity
## 
## 
## Time: 1.8 secs

Predict on test set

Confusion matrix for test set

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3773 2160
##          1 1953 3495
##                                           
##                Accuracy : 0.6386          
##                  95% CI : (0.6297, 0.6474)
##     No Information Rate : 0.5031          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.277           
##  Mcnemar's Test P-Value : 0.001318        
##                                           
##             Sensitivity : 0.6589          
##             Specificity : 0.6180          
##          Pos Pred Value : 0.6359          
##          Neg Pred Value : 0.6415          
##              Prevalence : 0.5031          
##          Detection Rate : 0.3315          
##    Detection Prevalence : 0.5213          
##       Balanced Accuracy : 0.6385          
##                                           
##        'Positive' Class : 0               
## 

ROC Curve

Random Forest

Train the model on train set

##                 Length Class  Mode     
## call                4  -none- call     
## type                1  -none- character
## predicted       27046  factor numeric  
## err.rate         1500  -none- numeric  
## confusion           6  -none- numeric  
## votes           54092  matrix numeric  
## oob.times       27046  -none- numeric  
## classes             2  -none- character
## importance         49  -none- numeric  
## importanceSD        0  -none- NULL     
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             14  -none- list     
## y               27046  factor numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call

Plot number of trees vs error

Plot feature importance

Predict on test set

Confusion matrix for test set

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 3845 1932
##          1 1881 3723
##                                           
##                Accuracy : 0.665           
##                  95% CI : (0.6562, 0.6736)
##     No Information Rate : 0.5031          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.3299          
##  Mcnemar's Test P-Value : 0.4181          
##                                           
##             Sensitivity : 0.6715          
##             Specificity : 0.6584          
##          Pos Pred Value : 0.6656          
##          Neg Pred Value : 0.6643          
##              Prevalence : 0.5031          
##          Detection Rate : 0.3378          
##    Detection Prevalence : 0.5076          
##       Balanced Accuracy : 0.6649          
##                                           
##        'Positive' Class : 0               
## 

ROC Curve

Sonam